> > Currently, costs of moves are also used for costs of RTL expressions. This > > patch: > > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html > > > > includes: > > > > diff --git a/gcc/config/i386/x86-tune-costs.h > > b/gcc/config/i386/x86-tune-costs.h > > index e943d13..8409a5f 100644 > > --- a/gcc/config/i386/x86-tune-costs.h > > +++ b/gcc/config/i386/x86-tune-costs.h > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { > > {4, 4, 4}, /* cost of loading integer registers > > in QImode, HImode and SImode. > > Relative to reg-reg move (2). */ > > - {6, 6, 6}, /* cost of storing integer registers */ > > + {6, 6, 3}, /* cost of storing integer registers */ > > 2, /* cost of reg,reg fld/fst */ > > {6, 6, 8}, /* cost of loading fp registers > > in SFmode, DFmode and XFmode */
Well, it seems that the patch was fixing things on wrong spot - the tables are intended to be mostly latency based. I think we ought to document divergences from these including benchmarks where the change helped. Otherwise it is very hard to figure out why the entry does not match the reality. > > > > It lowered the cost for SImode store and made it cheaper than SSE<->integer > > register move. It caused a regression: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878 > > > > Since the cost for SImode store is also used to compute scalar_store > > in ix86_builtin_vectorization_cost, it changed loop costs in > > > > void > > foo (long p2, long *diag, long d, long i) > > { > > long k; > > k = p2 < 3 ? p2 + p2 : p2 + 3; > > while (i < k) > > diag[i++] = d; > > } > > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake, > > instead of 3. > > > > My patch separates costs of moves from costs of RTL expressions. We have > > a follow up patch which restores the cost for SImode store back to 6 and > > leave > > the cost of scalar_store unchanged. It keeps loop unrolling unchanged and > > improves powf performance in glibc by 30%. We are collecting SPEC CPU 2017 > > data now. I have seen the problem with scalar_store with AMD tuning as well. It seems to make SLP vectorizer to be happy about idea of turning sequence of say integer tores into code which moves all the values into AVX register and then does one vector store. The cost basically compare cost of N scalar stores to 1 scalar store + vector construction. Vector construction then N*sse_op+addss. With testcase: short array[8]; test (short a,short b,short c,short d,short e,short f,short g,short h) { array[0]=a; array[1]=b; array[2]=c; array[3]=d; array[4]=e; array[5]=f; array[6]=g; array[7]=h; } int iarray[8]; test2 (int a,int b,int c,int d,int e,int f,int g,int h) { iarray[0]=a; iarray[1]=b; iarray[2]=c; iarray[3]=d; iarray[4]=e; iarray[5]=f; iarray[6]=g; iarray[7]=h; } I get the following codegen: test: vmovd %edi, %xmm0 vmovd %edx, %xmm2 vmovd %r8d, %xmm1 vmovd 8(%rsp), %xmm3 vpinsrw $1, 16(%rsp), %xmm3, %xmm3 vpinsrw $1, %esi, %xmm0, %xmm0 vpinsrw $1, %ecx, %xmm2, %xmm2 vpinsrw $1, %r9d, %xmm1, %xmm1 vpunpckldq %xmm2, %xmm0, %xmm0 vpunpckldq %xmm3, %xmm1, %xmm1 vpunpcklqdq %xmm1, %xmm0, %xmm0 vmovaps %xmm0, array(%rip) ret test2: vmovd %r8d, %xmm5 vmovd %edx, %xmm6 vmovd %edi, %xmm7 vpinsrd $1, %r9d, %xmm5, %xmm1 vpinsrd $1, %ecx, %xmm6, %xmm3 vpinsrd $1, %esi, %xmm7, %xmm0 vpunpcklqdq %xmm3, %xmm0, %xmm0 vmovd 16(%rbp), %xmm4 vpinsrd $1, 24(%rbp), %xmm4, %xmm2 vpunpcklqdq %xmm2, %xmm1, %xmm1 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 vmovdqu %ymm0, iarray(%rip) vzeroupper ret which is about 20% slower on my skylake notebook than the non-SLP-vectorized variant. I wonder if the vec_construct costs should be made more realistic. It is computed as: case vec_construct: { /* N element inserts into SSE vectors. */ int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss); /* One vinserti64x4 and two vinserti128 for combining SSE and AVX256 vectors to AVX512. */ else if (GET_MODE_BITSIZE (mode) == 512) cost += 3 * ix86_vec_cost (mode, ix86_cost->addss); return cost; So it expects 8 simple SSE operations + one SSE FP arithmetical operations. While code above has 8 inter-unit moves + 3 SSE integer operations to shuffle things around. Not mentioning the increased register pressure. I would say that for integer constructs it is a common case that things needs to be moved from integer unit to SSE. Overall the problem is deeper since vectorizer really may need to get better idea about latencies and throughputs to estimate loop times more realistically. One also may want to account somewhat that stores are often not part of the hot path and thus their latency is not too critical and the fact that vector stores prevents later partial memory stalls on the other hand... Honza