On Thu, Jun 20, 2019 at 2:10 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > Currently, costs of moves are also used for costs of RTL expressions. > > > This > > > patch: > > > > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html > > > > > > includes: > > > > > > diff --git a/gcc/config/i386/x86-tune-costs.h > > > b/gcc/config/i386/x86-tune-costs.h > > > index e943d13..8409a5f 100644 > > > --- a/gcc/config/i386/x86-tune-costs.h > > > +++ b/gcc/config/i386/x86-tune-costs.h > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { > > > {4, 4, 4}, /* cost of loading integer registers > > > in QImode, HImode and SImode. > > > Relative to reg-reg move (2). */ > > > - {6, 6, 6}, /* cost of storing integer registers */ > > > + {6, 6, 3}, /* cost of storing integer registers */ > > > 2, /* cost of reg,reg fld/fst */ > > > {6, 6, 8}, /* cost of loading fp registers > > > in SFmode, DFmode and XFmode */ > > Well, it seems that the patch was fixing things on wrong spot - the > tables are intended to be mostly latency based. I think we ought to > document divergences from these including benchmarks where the change > helped. Otherwise it is very hard to figure out why the entry does not > match the reality. > > > > > > It lowered the cost for SImode store and made it cheaper than > > > SSE<->integer > > > register move. It caused a regression: > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878 > > > > > > Since the cost for SImode store is also used to compute scalar_store > > > in ix86_builtin_vectorization_cost, it changed loop costs in > > > > > > void > > > foo (long p2, long *diag, long d, long i) > > > { > > > long k; > > > k = p2 < 3 ? p2 + p2 : p2 + 3; > > > while (i < k) > > > diag[i++] = d; > > > } > > > > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake, > > > instead of 3. > > > > > > My patch separates costs of moves from costs of RTL expressions. We have > > > a follow up patch which restores the cost for SImode store back to 6 and > > > leave > > > the cost of scalar_store unchanged. It keeps loop unrolling unchanged and > > > improves powf performance in glibc by 30%. We are collecting SPEC CPU > > > 2017 > > > data now. > > I have seen the problem with scalar_store with AMD tuning as well. > It seems to make SLP vectorizer to be happy about idea of turning > sequence of say integer tores into code which moves all the values into > AVX register and then does one vector store. > > The cost basically compare cost of N scalar stores to 1 scalar store + > vector construction. Vector construction then N*sse_op+addss. > > With testcase: > > short array[8]; > test (short a,short b,short c,short d,short e,short f,short g,short h) > { > array[0]=a; > array[1]=b; > array[2]=c; > array[3]=d; > array[4]=e; > array[5]=f; > array[6]=g; > array[7]=h; > } > int iarray[8]; > test2 (int a,int b,int c,int d,int e,int f,int g,int h) > { > iarray[0]=a; > iarray[1]=b; > iarray[2]=c; > iarray[3]=d; > iarray[4]=e; > iarray[5]=f; > iarray[6]=g; > iarray[7]=h; > } > > I get the following codegen: > > > test: > vmovd %edi, %xmm0 > vmovd %edx, %xmm2 > vmovd %r8d, %xmm1 > vmovd 8(%rsp), %xmm3 > vpinsrw $1, 16(%rsp), %xmm3, %xmm3 > vpinsrw $1, %esi, %xmm0, %xmm0 > vpinsrw $1, %ecx, %xmm2, %xmm2 > vpinsrw $1, %r9d, %xmm1, %xmm1 > vpunpckldq %xmm2, %xmm0, %xmm0 > vpunpckldq %xmm3, %xmm1, %xmm1 > vpunpcklqdq %xmm1, %xmm0, %xmm0 > vmovaps %xmm0, array(%rip) > ret > > test2: > vmovd %r8d, %xmm5 > vmovd %edx, %xmm6 > vmovd %edi, %xmm7 > vpinsrd $1, %r9d, %xmm5, %xmm1 > vpinsrd $1, %ecx, %xmm6, %xmm3 > vpinsrd $1, %esi, %xmm7, %xmm0 > vpunpcklqdq %xmm3, %xmm0, %xmm0 > vmovd 16(%rbp), %xmm4 > vpinsrd $1, 24(%rbp), %xmm4, %xmm2 > vpunpcklqdq %xmm2, %xmm1, %xmm1 > vinserti128 $0x1, %xmm1, %ymm0, %ymm0 > vmovdqu %ymm0, iarray(%rip) > vzeroupper > ret > > which is about 20% slower on my skylake notebook than the > non-SLP-vectorized variant. > > I wonder if the vec_construct costs should be made more realistic. > It is computed as: > > case vec_construct: > { > /* N element inserts into SSE vectors. */ > int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > /* One vinserti128 for combining two SSE vectors for AVX256. */ > if (GET_MODE_BITSIZE (mode) == 256) > cost += ix86_vec_cost (mode, ix86_cost->addss); > /* One vinserti64x4 and two vinserti128 for combining SSE > and AVX256 vectors to AVX512. */ > else if (GET_MODE_BITSIZE (mode) == 512) > cost += 3 * ix86_vec_cost (mode, ix86_cost->addss); > return cost; > > So it expects 8 simple SSE operations + one SSE FP arithmetical > operations. While code above has 8 inter-unit moves + 3 SSE integer > operations to shuffle things around. Not mentioning the increased > register pressure. > > I would say that for integer constructs it is a common case that things > needs to be moved from integer unit to SSE. > > Overall the problem is deeper since vectorizer really may need to get > better idea about latencies and throughputs to estimate loop times more > realistically. > > One also may want to account somewhat that stores are often not part > of the hot path and thus their latency is not too critical and the > fact that vector stores prevents later partial memory stalls on the > other hand... >
I opened: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952 We shouldn't use costs for moves for costs of RTL expressions. We can experiment different RTL expression cost formulas. But we need to separate costs of RTL expressions from costs for moves first. What is the best way to partition processor_costs to avoid confusion between costs of moves vs. costs of RTL expressions? -- H.J.