[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2017-01-28 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #14 from Andrew Pinski  ---
(In reply to Maxim Kuvyrkov from comment #12) 
> You are making an orthogonal point to this bug report: whether or not to
> vectorize such a loop.  But if loop is vectorized, then on any
> microarchitecture it is better to have "st2" vs "umov; st1; str".

Yes but thinking about the problem some more I do think there are some vector
cost model issue in the aarch64 backend where we don't model int vs floating
point cost differences.  For an example ^ for scalar int might be one cycle but
vector it is 4 cycles but for floating point scalar addition, it is 4 cycles
while the floating point vector addition is just 4 cycles.
struct cpu_vector_cost
{
  const int scalar_stmt_cost;/* Cost of any scalar operation,
excluding load and store.  */
...

  const int vec_stmt_cost;   /* Cost of any vector operation,
excluding load, store, permute,
vector-to-scalar and
scalar-to-vector operation.  */


Anyways I filed PR 79262 for the regression.

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2016-12-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #13 from Richard Biener  ---
(In reply to Maxim Kuvyrkov from comment #9)
> I've looked into another case where inability to handle stores with gaps
> generates sub-optimal code.  I'm interested in spending some time on fixing
> this, provided some guidance in the vectorizer.
> 
> Is it substantially more difficult to handle stores with gaps compared to
> loads with gaps?

It has the complication that we can't actually store to the gaps because
that creates store data races (and we'd need a load-modify-write cycle).

So we have to emit either scalar stores (which is what we currently do),
emit masked stores (not implemented yet) or something you suggest
(I suppose that's a store-lanes kind?).

A slight complication is that we have to avoid detecting the store group
if we want to end up with scalar stores (well, that's a vectorizer
implementation limit).  This is why we simply split all groups at gap
boundaries.  Cost-based selection of the kind of store (or even load)
implementation is not implemented.

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2016-12-12 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #12 from Maxim Kuvyrkov  ---
(In reply to Andrew Pinski from comment #11)
> (In reply to Maxim Kuvyrkov from comment #9)
> > which then becomes for aarch64:
> > .L4:
> > ld2 {v0.2d - v1.2d}, [x1]
> > add w2, w2, 1
> > cmp w2, w7
> > eor v0.16b, v2.16b, v0.16b
> > umovx4, v0.d[1]
> > st1 {v0.d}[0], [x1]
> > add x1, x1, 32
> > str x4, [x1, -16]
> > bcc .L4
> 
> 
> What I did for thunderx was create a vector cost model which caused this
> loop not be vectorized to get the regression from happening.  Not this might
> actually be better code for some micro arch. I need to check with the new
> processor we have in house but that is next week or so.  I don't know how
> much I can share next week though.

You are making an orthogonal point to this bug report: whether or not to
vectorize such a loop.  But if loop is vectorized, then on any
microarchitecture it is better to have "st2" vs "umov; st1; str".

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2016-12-12 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #11 from Andrew Pinski  ---
(In reply to Maxim Kuvyrkov from comment #9)
> I've looked into another case where inability to handle stores with gaps
> generates sub-optimal code.  I'm interested in spending some time on fixing
> this, provided some guidance in the vectorizer.
> 
> Is it substantially more difficult to handle stores with gaps compared to
> loads with gaps?
> 
> The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(),
> which is #2 function in 462.libquantum profile.  This cycle accounts for
> about 25% of total 462.libquantum time.
> 
> ===struct node_struct
> {
>   float _Complex gap;
>   unsigned long long state;
> };
> 
> struct reg_struct
> {
>   int size;
>   struct node_struct *node;
> };
> 
> void
> func(int target, struct reg_struct *reg)
> {
>   int i;
> 
>   for(i=0; isize; i++)
> reg->node[i].state ^= ((unsigned long long) 1 << target);
> }
> ===
> 
> This loop vectorizes into
>   :
>   # vectp.8_39 = PHI 
>   vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]);
>   vect__5.11_41 = vect_array.10[0];
>   vect__5.12_42 = vect_array.10[1];
>   vect__7.13_44 = vect__5.11_41 ^ vect_cst__43;
>   _48 = BIT_FIELD_REF ;
>   MEM[(long long unsigned int *)ivtmp_45] = _48;
>   ivtmp_50 = ivtmp_45 + 16;
>   _51 = BIT_FIELD_REF ;
>   MEM[(long long unsigned int *)ivtmp_50] = _51;
> 
> which then becomes for aarch64:
> .L4:
>   ld2 {v0.2d - v1.2d}, [x1]
>   add w2, w2, 1
>   cmp w2, w7
>   eor v0.16b, v2.16b, v0.16b
>   umovx4, v0.d[1]
>   st1 {v0.d}[0], [x1]
>   add x1, x1, 32
>   str x4, [x1, -16]
>   bcc .L4


What I did for thunderx was create a vector cost model which caused this loop
not be vectorized to get the regression from happening.  Not this might
actually be better code for some micro arch. I need to check with the new
processor we have in house but that is next week or so.  I don't know how much
I can share next week though.

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2016-12-12 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #10 from Maxim Kuvyrkov  ---
(In reply to Maxim Kuvyrkov from comment #9)
> which then becomes for aarch64:
> .L4:
>   ld2 {v0.2d - v1.2d}, [x1]
>   add w2, w2, 1
>   cmp w2, w7
>   eor v0.16b, v2.16b, v0.16b
>   umovx4, v0.d[1]
>   st1 {v0.d}[0], [x1]
>   add x1, x1, 32
>   str x4, [x1, -16]
>   bcc .L4

IIUC,
umovx4, v0.d[1]
st1 {v0.d}[0], [x1]
str x4, [x1, -16]
could become just
st2 {v0.d - v1.2d}, [x1]

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2016-12-12 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

Maxim Kuvyrkov  changed:

   What|Removed |Added

 CC||mkuvyrkov at gcc dot gnu.org

--- Comment #9 from Maxim Kuvyrkov  ---
I've looked into another case where inability to handle stores with gaps
generates sub-optimal code.  I'm interested in spending some time on fixing
this, provided some guidance in the vectorizer.

Is it substantially more difficult to handle stores with gaps compared to loads
with gaps?

The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(),
which is #2 function in 462.libquantum profile.  This cycle accounts for about
25% of total 462.libquantum time.

===struct node_struct
{
  float _Complex gap;
  unsigned long long state;
};

struct reg_struct
{
  int size;
  struct node_struct *node;
};

void
func(int target, struct reg_struct *reg)
{
  int i;

  for(i=0; isize; i++)
reg->node[i].state ^= ((unsigned long long) 1 << target);
}
===

This loop vectorizes into
  :
  # vectp.8_39 = PHI 
  vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]);
  vect__5.11_41 = vect_array.10[0];
  vect__5.12_42 = vect_array.10[1];
  vect__7.13_44 = vect__5.11_41 ^ vect_cst__43;
  _48 = BIT_FIELD_REF ;
  MEM[(long long unsigned int *)ivtmp_45] = _48;
  ivtmp_50 = ivtmp_45 + 16;
  _51 = BIT_FIELD_REF ;
  MEM[(long long unsigned int *)ivtmp_50] = _51;

which then becomes for aarch64:
.L4:
ld2 {v0.2d - v1.2d}, [x1]
add w2, w2, 1
cmp w2, w7
eor v0.16b, v2.16b, v0.16b
umovx4, v0.d[1]
st1 {v0.d}[0], [x1]
add x1, x1, 32
str x4, [x1, -16]
bcc .L4

[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2013-03-27 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438



--- Comment #8 from Richard Biener rguenth at gcc dot gnu.org 2013-03-27 
11:27:31 UTC ---

The issue is that we cannot use a vector v4sf store to opoints[i][0]

as opoints[i][4] is not stored to.  Such masked store (or interleaved

store with gaps) is not supported by SLP.


[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2012-07-13 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Blocks||53947

--- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-07-13 
08:43:04 UTC ---
Link to vectorizer missed-optimization meta-bug.


[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2011-05-22 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

Steven Bosscher steven at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed|2006-09-19 07:10:15 |2011-05-22 17:40:15

--- Comment #6 from Steven Bosscher steven at gcc dot gnu.org 2011-05-22 
15:40:28 UTC ---
Still not vectorized in recent GCC 
t.c:20: note: not vectorized: complicated access pattern.
t.c:22: note: not vectorized: complicated access pattern.


 1typedef unsigned int bool;
 2#define true 1
 3 
 4#define NUMPOINTS 5
 5 
 6#define align(x) __attribute__((align(x)))
 7 
 8typedef float align(16) MATRIX[3][3];
 9 
10static float points[NUMPOINTS][4];
11static align(16) float opoints[NUMPOINTS][4];
12static bool flags[NUMPOINTS];
13static MATRIX gmatrix;
14 
15 
16void RotateVectors (void)
17{
18  int i, r;
19 
20  for (r = 0; r  4; r++)
21  {
22for (i = 0; i  NUMPOINTS; i++)
23{
24  opoints[i][0] = gmatrix[0][0] * points[i][0]
25+ gmatrix[0][1] * points[i][1]
26+ gmatrix[0][2] * points[i][2];
27  opoints[i][1] = gmatrix[1][0] * points[i][0]
28+ gmatrix[1][1] * points[i][1]
29+ gmatrix[1][2] * points[i][2];
30  opoints[i][2] = gmatrix[2][0] * points[i][0]
31+ gmatrix[2][1] * points[i][1]
32+ gmatrix[2][2] * points[i][2];
33  flags[i] = true;
34}
35  }
36}
37

GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]


[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2007-01-06 Thread irar at il dot ibm dot com


--- Comment #5 from irar at il dot ibm dot com  2007-01-07 07:40 ---
On the todo list.

BTW, vectorization of strided accesses was committed to the mainline 4.3.

Ira


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438



[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2007-01-04 Thread giovannibajo at libero dot it


--- Comment #4 from giovannibajo at libero dot it  2007-01-05 00:37 ---
Thanks Ira. What about store with gaps?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438



[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2006-09-19 Thread irar at il dot ibm dot com


--- Comment #3 from irar at il dot ibm dot com  2006-09-19 07:10 ---
 t.c:20: note: not vectorized: mixed data-types
 t.c:20: note: can't determine vectorization factor.

 Removing flags[i] = true;

Multiple data-types vectorization is already supported in the autovect branch,
and the patches for mainline (starting from
http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00941.html) will be committed as
soon as 4.3 is open.  


 we get:
 t.c:20: note: not consecutive access
 t.c:20: note: not vectorized: complicated access pattern.

Vectorization of strided accesses is also already implemented in the autovect
branch (and will be committed to the mainline 4.3). However, this case contains
stores with gaps (stores to opoints[i][0], opoints[i][1], and opoints[i][2],
without a store to opoints[i][3]), and only loads with gaps are currently
supported.

Therefore, this loop will be vectorizable in the autovect branch (and soon in
the mainline 4.3) if a store to opoints[i][3] is added.

Ira


-- 

irar at il dot ibm dot com changed:

   What|Removed |Added

 CC||irar at il dot ibm dot com
   Last reconfirmed|2005-12-21 03:49:03 |2006-09-19 07:10:15
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438



[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2005-09-20 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2005-09-20 
17:47 ---
t.c:20: note: not vectorized: mixed data-types
t.c:20: note: can't determine vectorization factor.

Removing flags[i] = true;
we get:
t.c:20: note: not consecutive access
t.c:20: note: not vectorized: complicated access pattern.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438


[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication

2004-11-11 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2004-11-12 
02:43 ---
Confirmed, ICC can do this but does not because it is not very inefficient to 
do it.

-- 
   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever Confirmed||1
   Last reconfirmed|-00-00 00:00:00 |2004-11-12 02:43:35
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438