[Bug rtl-optimization/85017] New: Missed constant propagation to index of array reference

2018-03-21 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85017

Bug ID: 85017
   Summary: Missed constant propagation to index of array
reference
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: sergey.shalnov at intel dot com
  Target Milestone: ---

I found extra instruction generated on x86_64 platform. I have no performance
data to prove performance gap for this but other compilers generates more
optimal code from my point of view.

Command line (the same with -O2 and -O3):
gcc -c -O1 test.c


The source code:

char arr[400];

char foo(int c)
{
int i = c + 20;
return arr[i];
}

Main trunk GCC generates following:
  addl$20, %edi
  movslq  %edi, %rdi
  movzbl  arr(%rdi), %eax
  ret

but other compilers (example with clang):
  movslq  %edi, %rax
  movbarr+20(%rax), %al
  retq

The constant $20 is better to propagate into array loading instead adding it to
the register explicitly.

Could someone please advice me a workaround or patch? I will provide a
performance data for this.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-09 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

sergey.shalnov at intel dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #38 from sergey.shalnov at intel dot com ---
Implemented

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #36 from sergey.shalnov at intel dot com ---
The patch fixes the issue for SKX is in
https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html

I will close the PR after the patch has been merged.

Thank you very much for all involved.
Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-29 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #33 from sergey.shalnov at intel dot com ---
Richard,
I'm not sure is it a regression or not. I see code has been visibly refactored 
in this commit
https://github.com/gcc-mirror/gcc/commit/ee6e9ba576099aed29f1097195c649fc796ecf5e
in 2013 year.

However this fix is highly important for our workloads and really desirable for
GCC8.
For example, your patch plus SKX cost model changes give us ~+1% in geomean
with spec2017 intrate.

It would be really good for us to make this patch into GCC8.

Cost model changes planned to be (will be proposed separately):

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index e943d13..d5e6ef6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
+  {6, 6, 3},   /* cost of storing integer registers */
   2,   /* cost of reg,reg fld/fst */
   {6, 6, 8},   /* cost of loading fp registers
   in SFmode, DFmode and XFmode */
@@ -1572,7 +1572,7 @@ struct processor_costs skylake_cost = {
   {6, 6, 6, 10, 20},   /* cost of loading SSE registers
   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
-  {8, 8, 8, 8, 16},/* cost of storing SSE registers
+  {8, 8, 8, 14, 24},   /* cost of storing SSE registers
   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2,/* SSE->integer and
integer->SSE moves */

I know that you have some concerns about costs above, I'm saving it for next
discussion 
because your patch is a good foundation to proceed with costs tuning.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-26 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #31 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your latest patch. This patch is exactly that 
I’ve discussed in this issue request.
I tested it with SPEC20[06|17] and see no performance/stability degradation.

Could you please merge your latest patch into main trunk?

I will provide Intel specific changes (in gcc/config/i386) that 
will leverage your patch to get the performance better 
for the provided testcase (step N2)

Thank you
Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-19 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #29 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your latest patch. I would like to clarify 
the multiple_p() function usage in if() clause.

First of all,  I assume that architectures with fixed 
size of HW registers (x86) should go to if(){} branch, 
but arch with unknown registers size should go to else{}.
This is because is_constant() function used.

The problem is in multiple_p() function. In the test case 
we have group_size=16 and const_nunits=8.
So, the multiple_p(16, 8) returns 1 and with 
–march=skylake-avx512 (or –march=znver1) we go to 
else{} branch which is not correct.

I used following change in your patch and test works as I expect.
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1923,7 +1923,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree
node,
  unsigned HOST_WIDE_INT const_nunits;
  unsigned nelt_limit;
  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits)
- && ! multiple_p (group_size, const_nunits))
+ && multiple_p (group_size, const_nunits))
{
  num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
  nelt_limit = const_nunits;
-- 

Other thing here is what we should do if group_size is, for example, 17. 
In this case (after my changes) wrong else{} branch will be taken.

I would propose to change this particular if() in following way.

  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits))
{
If(multiple_p (group_size, const_nunits))
  {
num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
nelt_limit = const_nunits;
  }
else
  {
...not clear what we should have here...
  }
}
  else
{
  num_vects_to_check = 1;
  nelt_limit = group_size;
}

Or perhaps multiple_p() should not be here at all?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-17 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #28 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your comments. 
I see that TYPE_VECTOR_SUBPARTS is constant for for the test case but
multiple_p (group_size, const_nunits) returns 1 in the code:
  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits)
  && ! multiple_p (group_size, const_nunits))
{
  num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
  nelt_limit = const_nunits;
}
  else
{
  num_vects_to_check = 1;
  nelt_limit = group_size;
}
This is because group_size = 16 and const_nunits = 8 in the test case with
-march=skylake-avx512.
And control flow goes to "num_vects_to_check = 1".
Anyway, the ncopies_for_cost = 2 and equation " ncopies_for_cost * const_nunits
/ group_size " will be 1.

Do you think we have any possibility to make a conditional clause to make
num_vects_to_check = 2?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #26 from sergey.shalnov at intel dot com ---
Sorry, did you meant "arm_sve.h" on ARM?
In this case we have machine specific code in common part of the gcc code.
Should we make it as machine dependent callback function because having
"num_vects_to_check = 1" is incorrect for SSE?

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #24 from sergey.shalnov at intel dot com ---
Richard,
The latest "SLP costing for constants/externs improvement" patch generates the
same code as baseline for the test example.

Are you sure that "num_vects_to_check" should 1 if vector is not constant? I
would expect "num_vects_to_check = ncopies_for_cost;" here:

1863  else
1864{
1865  num_vects_to_check = 1;
1866  nelt_limit = group_size;
1867}

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #21 from sergey.shalnov at intel dot com ---
Thanks Richard for your comments. 
Based on our discussion I've produced the patch attached and
run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto
-march=skylake-avx512 -mfpmath=sse -funroll-loops]).
Please note, I used your last proposed patch and changed loop trip count
calculation ("ncopies_for_cost * nunits / group_size" is always 1).

I see the following performance changes:
SPEC CPU 2017 intrate
500.perlbench_r -0.7%
525.x264_r +7.2%
Geomean: +0.8%

SPEC CPU 2017 fprate
527.cam4_r -1.1%
538.imagick_r +4.7%
544.nab_r +3.6%
Geomean: +0.6%

I believe that after appropriate cost model tweaks for other targets a gain
could be observed but I haven't checked it carefully.
It provides a good performance gain for the original case and a few other
improvements.

Can you please take a look at the patch and comment (or might propose another
method)?

Sergey

From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001
From: Sergey Shalnov <sergey.shal...@intel.com>
Date: Tue, 9 Jan 2018 14:37:14 +0100
Subject: [PATCH, SLP] SLP_common_algo_changes

---
 gcc/config/i386/x86-tune-costs.h |  4 ++--
 gcc/tree-vect-slp.c  | 41 ++--
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 312467d..3e0f904 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
+  {6, 6, 4},   /* cost of storing integer registers. 
*/
   2,   /* cost of reg,reg fld/fst */
   {6, 6, 8},   /* cost of loading fp registers
   in SFmode, DFmode and XFmode */
@@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = {
   {6, 6, 6, 10, 20},   /* cost of loading SSE registers
   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
-  {8, 8, 8, 8, 16},/* cost of storing SSE registers
+  {8, 8, 8, 16, 32},   /* cost of storing SSE registers
   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2,/* SSE->integer and
integer->SSE moves */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 0ca42b4..7e63a1c 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance,
slp_tree node,
   enum vect_def_type dt;
   if (!op || op == lhs)
continue;
-  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ))
+  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )
+&& (dt == vect_constant_def || dt == vect_external_def))
{
  /* Without looking at the actual initializer a vector of
 constants can be implemented as load from the constant pool.
-???  We need to pass down stmt_info for a vector type
-even if it points to the wrong stmt.  */
- if (dt == vect_constant_def)
-   record_stmt_cost (prologue_cost_vec, 1, vector_load,
- stmt_info, 0, vect_prologue);
- else if (dt == vect_external_def)
-   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
- stmt_info, 0, vect_prologue);
+When all elements are the same we can use a splat.  */
+ unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length ();
+ tree elt = NULL_TREE;
+ unsigned nelt = 0;
+ for (unsigned j = 0; j < ncopies_for_cost; ++j)
+   for (unsigned k = 0; k < group_size; ++k)
+ {
+   if (nelt == 0)
+ elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i);
+   /* ???  We're just tracking whether all operands of a single
+  vector initializer are the same, ideally we'd check if
+  we emitted the same one already.  */
+   else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt],
+  i))
+ elt = NULL_TREE;
+   nelt++;
+   if (nelt == group_size)
+ {
+   /* ???  We need to pass down stmt_info for

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-02 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #20 from sergey.shalnov at intel dot com ---
Richard,
I did quick static analysis for your latest patch.
Using command line “-g -Ofast -mfpmath=sse -funroll-loops -march=znver1” your
latest patch 
doesn’t affects the issue I discussed but it affects costs for first loop. 
I thought the loop costs should be calculated in other place (tree-vect-loop.c)
but as I can see everything is interconnected.

The SLP block we discussed remains with the same statistic:
  Vector inside of basic block cost: 64
  Vector prologue cost: 32
  Vector epilogue cost: 0
  Scalar cost of basic block: 256
note: Basic block will be vectorized using SLP

First loop was:
note: Cost model analysis:.
  Vector inside of loop cost: 5392
  Vector prologue cost: 48
  Vector epilogue cost: 0
  Scalar iteration cost: 464
  Scalar outside cost: 0
  Vector outside cost: 48
  prologue iterations: 0
  epilogue iterations: 0
note: cost model: the vector iteration cost = 5392 divided by the scalar
iteration cost = 464 is greater or equal to the vectorization factor = 4.

Became:
note: Cost model analysis:
  Vector inside of loop cost: 5392
  Vector prologue cost: 192
  Vector epilogue cost: 0
  Scalar iteration cost: 464
  Scalar outside cost: 0
  Vector outside cost: 192
  prologue iterations: 0
  epilogue iterations: 0
note: cost model: the vector iteration cost = 5392 divided by the scalar
iteration cost = 464 is greater or equal to the vectorization factor = 4.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-24 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #18 from sergey.shalnov at intel dot com ---
Yes, I agree that vector_store stage has it’s own vectorization cost.
And each vector_store has vector_construction stage. These stages are different
in gcc slp (as you know).
To better illustrate my point of view I would like to propose a patch.
I didn’t submit the patch to community yet because I think it is better to
discuss it before that.

index 0ca42b4..72e3123 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree
node,
record_stmt_cost (prologue_cost_vec, 1, vector_load,
  stmt_info, 0, vect_prologue);
  else if (dt == vect_external_def)
-   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
+   record_stmt_cost (prologue_cost_vec, ncopies_for_cost,
vec_construct,
  stmt_info, 0, vect_prologue);
}
 }

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #16 from sergey.shalnov at intel dot com ---
«it's one vec_construct operation - it's the task of the target to turn this
into a cost comparable to vector_store»
I agree that vec_construct operation cost is based on the target cost model.

I think I don't completely understand why we have 1 vec_construct for 3
vector_store.
I would expect vec_construct for each vec_store operation.
I just looking for asm and see that each "vmovaps %xmm, (%rsp)" has it's own
construction procedure

(in this asm I extracted only one vector_store from 3 vector_store set
generated)
 vec_construct:
 2d3:   c5 79 6e dd vmovd  %ebp,%xmm11
 310:   c4 41 79 6e d0  vmovd  %r8d,%xmm10
 2e0:   c4 63 21 22 e2 01   vpinsrd $0x1,%edx,%xmm11,%xmm12
 31a:   c4 43 29 22 e9 01   vpinsrd $0x1,%r9d,%xmm10,%xmm13
 329:   c4 41 11 6c f4  vpunpcklqdq %xmm12,%xmm13,%xmm14
vector_store:
 340:   c5 78 29 74 24 d8   vmovaps %xmm14,-0x28(%rsp)

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #14 from sergey.shalnov at intel dot com ---
" we have a basic-block vectorizer.  Do you propose to remove it? "
Definitely not! SLP vectorizer is very good to have!

“What's the rationale for not using vector registers”
I just tried " -fno-tree-slp-vectorize" option and found the performance gain
for different -march= options.

I see some misunderstanding here. Let me clarify the original question with
–march=znver1.
I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for
experiments.

For the basic block we are discussing we have (in vect_analyze_slp_cost() in
tree-vect-slp.c:1897):

tmp[i_220][0] = _150;
tmp[i_220][2] = _147;
tmp[i_220][1] = _144;
tmp[i_220][3] = _141;

tmp[i_139][0] = _447;
tmp[i_139][2] = _450;
tmp[i_139][1] = _453;
tmp[i_139][3] = _456;

tmp[i_458][0] = _54;
tmp[i_458][2] = _56;
tmp[i_458][1] = _58;
tmp[i_458][3] = _60;

this is si->stmt printed in the loop with "vect_prologue" calculation.

I see SLP statistic related to this BB:
note: Cost model analysis:. 
  Vector inside of basic block cost: 64 
  Vector prologue cost: 32 
  Vector epilogue cost: 0 
  Scalar cost of basic block: 256 
note: Basic block will be vectorized using SLP

I see 12 statements that are calculated into 3 vector instructions with 4 data
type each (4*int->xmm)
group_size = 12
ncopies_for_cost = 3
nunits = 4

But I see "count" is 1 in cost vector related to prolog.
prologue_cost_vec = {m_vec = 0x3fc6e70 = {{count = 1, kind = vec_construct,
stmt = , misalign = 0}}}
body_cost_vec = {m_vec = 0x3fc6f70 = {{count = 3, kind = vector_store, stmt =
, misalign = 0}}}

Please correct me if I wrong but I think we have to have count=3 in
prologue_cost_vec.
And this could slightly change costs for "Vector prologue cost" and might have
an influence to vectorizer decision.

Sergey
PS
Richard,
I didn't catch your idea in " but DOM isn't powerful enough " sentence.
Could you please slightly clarify it?
Thank you.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #12 from sergey.shalnov at intel dot com ---
Richard,
Your last proposal changed the code generated a bit.
Currently is shows:
test_bugzilla1.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 62576 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 62576
divided by the scalar iteration cost = 328 is greater or equal to the
vectorization factor = 4.
test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable.
test_bugzilla1.c:6:5: note: not vectorized: vector version will never be
profitable.

And it uses xmm+ vpbroadcastd to spill tmp[] to stack
...
1e7:   62 d2 7d 08 7c c9   vpbroadcastd %r9d,%xmm1
 1ed:   c4 c1 79 7e c9  vmovd  %xmm1,%r9d
 1f2:   62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp)
 1f9:   c8 ff ff ff 
 1fd:   62 f2 7d 08 7c d7   vpbroadcastd %edi,%xmm2
 203:   c5 f9 7e d7 vmovd  %xmm2,%edi
 207:   62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp)
 20e:   d8 ff ff ff 
 212:   62 f2 7d 08 7c db   vpbroadcastd %ebx,%xmm3
 218:   c5 f9 7e de vmovd  %xmm3,%esi
 21c:   62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp)
 223:   e8 ff ff ff 
 227:   01 fe   add%edi,%esi
 229:   45 01 c8add%r9d,%r8d
 22c:   41 01 f0add%esi,%r8d
 22f:   8b 5c 24 dc mov-0x24(%rsp),%ebx
 233:   03 5c 24 ec add-0x14(%rsp),%ebx
 237:   8b 6c 24 bc mov-0x44(%rsp),%ebp
 23b:   03 6c 24 cc add-0x34(%rsp),%ebp
...

I think this is better in case of performance perspective but, as I said
before, not using vector registers here is the best option if no loops
vectorized.

In case of static loop increment (the first test case) - the first loop
vectorized as before.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #11 from sergey.shalnov at intel dot com ---
Richard,
“Is this about the "stupid" attempt to use as little AVX512 as possible”
No, it is not.
I provided asm listing at the beginning with zmm only to illustrate the issue
more crystalized. I used "-mprefer-vector-width=512" option to get that asm
output.
I believe that if we remove these registers (any. Including ymm and xmm) from
BB you mentioned - it will be profitable for all architectures.

" Sergey, can you test this?"
Yes. Going to try this.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #9 from sergey.shalnov at intel dot com ---
Created attachment 42813
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42813=edit
New reproducer

Slightly changed first loop

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #8 from sergey.shalnov at intel dot com ---
Richard,
This is great changes and I see the first loop became vectorized for the test
example I provided with gcc-8.0 main trunk.
But I think the issue a bit more complicated. Vectorization of the first loop
just hide the issue I reported.

Currently I see following:
test loop: “for (int i = 0; i < 4; i++, input1 += 4, input2 += 4)”
test_bugzilla.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 1136 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
  Calculated minimum iters for profitability: 0 
test_bugzilla.c:6:5: note:   Runtime profitability threshold = 4 
test_bugzilla.c:6:5: note:   Static estimate profitability threshold = 4 
test_bugzilla.c:6:5: note: loop vectorized

if I slightly change the loop (to be closer to real application): “for (int i =
0; i < 4; i++, input1 += stride1, input2 += stride2)”
test_bugzilla1.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 5232 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 5232
divided by the scalar iteration cost = 328 is greater or equal to the
vectorization factor = 4. 
test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. 
test_bugzilla1.c:6:5: note: not vectorized: vector version will never be
profitable.

And the issue with extra vector operations remains the same.

I’m not sure but I think it is really profitable to avoid vector registers
usage if the loop is not vectorized.
Do you agree?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #6 from sergey.shalnov at intel dot com ---
I found the issue request related to the vactorization issues in second loop
(reduction uint->int).

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #5 from sergey.shalnov at intel dot com ---
(In reply to Richard Biener from comment #2)
> The strange code is because we perform basic-block vectorization resulting in
> 
>   vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417,
> _417, _417, _417, _48, _48, _48, _48};
>   MEM[(unsigned int *)] = vect_cst__249;
>   _186 = tmp[0][0];
>   _185 = tmp[1][0];
> ...
> 
> which for some reason is deemed profitable:
> 
> t.c:32:12: note: Cost model analysis:
>   Vector inside of basic block cost: 24
>   Vector prologue cost: 64
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 192
> t.c:32:12: note: Basic block will be vectorized using SLP
> 
> what is odd is that the single vector store is costed 24 while the 16 scalar
> int stores are costed 192.  The vector build from scalar costs 64.
> 
> I guess Honzas cost-model tweaks might have gone wrong here or we're hitting
> an oddity in the SLP costing.
> 
> Even if it looks strange maybe the sequence _is_ profitable?
> 
> The second loop would be vectorized if 'sum' was unsigned.

Richard,
No, the sequence is not profitable. If we don't use any vector registers here
the performance will be better for all architectures.
I'm talking about vectorized code only here.

I'm trying to look into vect_stmt_relevant_p() function to implement additional
limitations and avoid block vectorization if the loop is not vectorized.

If you have any idea how to avoid vectorization in this particular place -
please let me know.

Sergey

[Bug tree-optimization/65930] Reduction with sign-change not handled

2017-12-05 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930

sergey.shalnov at intel dot com changed:

   What|Removed |Added

 CC||sergey.shalnov at intel dot com

--- Comment #4 from sergey.shalnov at intel dot com ---
Richard,
I can confirm that the issue exists and it can be solved by make data types of
accumulator and equation equal.
As I can see you propose to introduce intermediate internal accumulator in
uint32 type and store it into accumulator (int) later. Please correct me if I'm
wrong.

Anyway, may I, please, propose a bit simpler way to solve the issue?
In GIMPLE statement we have (in the place of reduction tree-vect=loop.c:2877):
"sum_12 = (int) _6;"

I believe the issue disappears if we change it to:
"sum_12 = _6;"

I'm not sure but, if I remember correctly, the C standard treat these types
(uint->int) casting as "undefined behavior". In this case, the compiler can do
this type cast(it might be under some command line switches).

What do you think, could it help?
Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #1 from sergey.shalnov at intel dot com ---
Created attachment 42616
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42616=edit
reproducer

[Bug target/83008] New: [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

Bug ID: 83008
   Summary: [performance] Is it better to avoid extra instructions
in data passing between loops?
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: sergey.shalnov at intel dot com
  Target Milestone: ---

I found strange code generated by GCC-8.0/7.x with following command line
options:
-g -Ofast -march=skylake-avx512 -ftree-vectorize

There are not vectorized two loops. 
First one doesn’t vectorized because:
test.c:6:23: note: cost model: the vector iteration cost = 1488 divided by the
scalar iteration cost = 328 is greater or equal to the vectorization factor =
4.
test.c:6:23: note: not vectorized: vectorization not profitable.
test.c:6:23: note: not vectorized: vector version will never be profitable.

Second one doesn’t vectorized because:
test.c:20:23: note: step unknown.
test.c:20:23: note: reduction: not commutative/associative: sum_87 = (int) _61;
test.c:20:23: note: Unknown def-use cycle pattern.
test.c:20:23: note: Unsupported pattern.
test.c:20:23: note: not vectorized: unsupported use in stmt.
test.c:20:23: note: unexpected pattern.

If we look into asm we found strange method to passing data in “tmp” array
between loops:
…loop 1 body…
 13f:   41 8d 04 13 lea(%r11,%rdx,1),%eax
 143:   c5 f9 6e d8 vmovd  %eax,%xmm3
 147:   c5 e1 62 db vpunpckldq %xmm3,%xmm3,%xmm3
 14b:   c5 f9 62 c0 vpunpckldq %xmm0,%xmm0,%xmm0
 14f:   c5 f1 62 c9 vpunpckldq %xmm1,%xmm1,%xmm1
 153:   c5 e9 62 d2 vpunpckldq %xmm2,%xmm2,%xmm2
 157:   c5 e9 6c d2 vpunpcklqdq %xmm2,%xmm2,%xmm2
 15b:   c5 f1 6c c9 vpunpcklqdq %xmm1,%xmm1,%xmm1
 15f:   c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0
 163:   c5 e1 6c db vpunpcklqdq %xmm3,%xmm3,%xmm3
 167:   c4 e3 6d 38 c9 01   vinserti128 $0x1,%xmm1,%ymm2,%ymm1
 16d:   c4 e3 7d 38 c3 01   vinserti128 $0x1,%xmm3,%ymm0,%ymm0
 173:   62 f3 f5 48 3a c0 01vinserti64x4 $0x1,%ymm0,%zmm1,%zmm0
 17a:   62 f1 fd 48 7f 44 24vmovdqa64 %zmm0,-0x40(%rsp)
 181:   ff 
 182:   8b 54 24 e0 mov-0x20(%rsp),%edx
 186:   03 54 24 f0 add-0x10(%rsp),%edx
…loop 2 body…

if I'm not mistaken the algorithm looks like following:
1.  Do first loop and keep values in GPR
2.  Move these GPRs to XMMs
3.  Pack these XMMs into YMMs
4.  Pack these YMMs to ZMM
5.  Spill ZMM into stack
6.  Get values from stack to GPRs of the second loop

It might be better, from performance perspective, to pass values from first
loop directly to the second loop with GPRs (without all these vector
registers)?

The reproducer is:
 1  int test(unsigned char * input1, unsigned char * input2)
 2  {
 3  unsigned int tmp[4][4];
 4  unsigned int var0, var1, var2, var3;
 5  int sum = 0;
 6  for (int i = 0; i < 4; i++, input1 += 4, input2 += 4) {
 7  var0 = (input1[0] + input2[0]) + (input1[4] + input2[4]);
 8  var1 = (input1[1] + input2[1]) + (input1[5] + input2[5]);
 9  var2 = (input1[2] + input2[2]) + (input1[6] + input2[6]);
10  var3 = (input1[3] + input2[3]) + (input1[7] + input2[7]);
11  int inter0 = var0 + var1;
12  int inter1 = var0 + var1;
13  int inter2 = var2 + var3;
14  int inter3 = var2 + var3;
15  tmp[i][0] = inter0 + inter2;
16  tmp[i][2] = inter0 + inter2;
17  tmp[i][1] = inter1 + inter3;
18  tmp[i][3] = inter1 + inter3;
19  }
20  for (int i = 0; i < 4; i++) {
21  int inter0 = tmp[0][i] + tmp[1][i];
22  int inter1 = tmp[0][i] + tmp[1][i];
23  int inter2 = tmp[2][i] + tmp[3][i];
24  int inter3 = tmp[2][i] + tmp[3][i];
25  var0 = inter0 + inter2;
26  var2 = inter0 + inter2;
27  var1 = inter1 + inter3;
28  var3 = inter1 + inter3;
29  sum += var0 + var1 + var2 + var3;
30  }
31
32  return sum;
33  }

Sergey