[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #14 from Andrew Pinski --- (In reply to Maxim Kuvyrkov from comment #12) > You are making an orthogonal point to this bug report: whether or not to > vectorize such a loop. But if loop is vectorized, then on any > microarchitecture it is better to have "st2" vs "umov; st1; str". Yes but thinking about the problem some more I do think there are some vector cost model issue in the aarch64 backend where we don't model int vs floating point cost differences. For an example ^ for scalar int might be one cycle but vector it is 4 cycles but for floating point scalar addition, it is 4 cycles while the floating point vector addition is just 4 cycles. struct cpu_vector_cost { const int scalar_stmt_cost;/* Cost of any scalar operation, excluding load and store. */ ... const int vec_stmt_cost; /* Cost of any vector operation, excluding load, store, permute, vector-to-scalar and scalar-to-vector operation. */ Anyways I filed PR 79262 for the regression.
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #13 from Richard Biener --- (In reply to Maxim Kuvyrkov from comment #9) > I've looked into another case where inability to handle stores with gaps > generates sub-optimal code. I'm interested in spending some time on fixing > this, provided some guidance in the vectorizer. > > Is it substantially more difficult to handle stores with gaps compared to > loads with gaps? It has the complication that we can't actually store to the gaps because that creates store data races (and we'd need a load-modify-write cycle). So we have to emit either scalar stores (which is what we currently do), emit masked stores (not implemented yet) or something you suggest (I suppose that's a store-lanes kind?). A slight complication is that we have to avoid detecting the store group if we want to end up with scalar stores (well, that's a vectorizer implementation limit). This is why we simply split all groups at gap boundaries. Cost-based selection of the kind of store (or even load) implementation is not implemented.
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #12 from Maxim Kuvyrkov --- (In reply to Andrew Pinski from comment #11) > (In reply to Maxim Kuvyrkov from comment #9) > > which then becomes for aarch64: > > .L4: > > ld2 {v0.2d - v1.2d}, [x1] > > add w2, w2, 1 > > cmp w2, w7 > > eor v0.16b, v2.16b, v0.16b > > umovx4, v0.d[1] > > st1 {v0.d}[0], [x1] > > add x1, x1, 32 > > str x4, [x1, -16] > > bcc .L4 > > > What I did for thunderx was create a vector cost model which caused this > loop not be vectorized to get the regression from happening. Not this might > actually be better code for some micro arch. I need to check with the new > processor we have in house but that is next week or so. I don't know how > much I can share next week though. You are making an orthogonal point to this bug report: whether or not to vectorize such a loop. But if loop is vectorized, then on any microarchitecture it is better to have "st2" vs "umov; st1; str".
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #11 from Andrew Pinski --- (In reply to Maxim Kuvyrkov from comment #9) > I've looked into another case where inability to handle stores with gaps > generates sub-optimal code. I'm interested in spending some time on fixing > this, provided some guidance in the vectorizer. > > Is it substantially more difficult to handle stores with gaps compared to > loads with gaps? > > The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), > which is #2 function in 462.libquantum profile. This cycle accounts for > about 25% of total 462.libquantum time. > > ===struct node_struct > { > float _Complex gap; > unsigned long long state; > }; > > struct reg_struct > { > int size; > struct node_struct *node; > }; > > void > func(int target, struct reg_struct *reg) > { > int i; > > for(i=0; isize; i++) > reg->node[i].state ^= ((unsigned long long) 1 << target); > } > === > > This loop vectorizes into > : > # vectp.8_39 = PHI> vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); > vect__5.11_41 = vect_array.10[0]; > vect__5.12_42 = vect_array.10[1]; > vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; > _48 = BIT_FIELD_REF ; > MEM[(long long unsigned int *)ivtmp_45] = _48; > ivtmp_50 = ivtmp_45 + 16; > _51 = BIT_FIELD_REF ; > MEM[(long long unsigned int *)ivtmp_50] = _51; > > which then becomes for aarch64: > .L4: > ld2 {v0.2d - v1.2d}, [x1] > add w2, w2, 1 > cmp w2, w7 > eor v0.16b, v2.16b, v0.16b > umovx4, v0.d[1] > st1 {v0.d}[0], [x1] > add x1, x1, 32 > str x4, [x1, -16] > bcc .L4 What I did for thunderx was create a vector cost model which caused this loop not be vectorized to get the regression from happening. Not this might actually be better code for some micro arch. I need to check with the new processor we have in house but that is next week or so. I don't know how much I can share next week though.
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #10 from Maxim Kuvyrkov --- (In reply to Maxim Kuvyrkov from comment #9) > which then becomes for aarch64: > .L4: > ld2 {v0.2d - v1.2d}, [x1] > add w2, w2, 1 > cmp w2, w7 > eor v0.16b, v2.16b, v0.16b > umovx4, v0.d[1] > st1 {v0.d}[0], [x1] > add x1, x1, 32 > str x4, [x1, -16] > bcc .L4 IIUC, umovx4, v0.d[1] st1 {v0.d}[0], [x1] str x4, [x1, -16] could become just st2 {v0.d - v1.2d}, [x1]
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 Maxim Kuvyrkov changed: What|Removed |Added CC||mkuvyrkov at gcc dot gnu.org --- Comment #9 from Maxim Kuvyrkov --- I've looked into another case where inability to handle stores with gaps generates sub-optimal code. I'm interested in spending some time on fixing this, provided some guidance in the vectorizer. Is it substantially more difficult to handle stores with gaps compared to loads with gaps? The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), which is #2 function in 462.libquantum profile. This cycle accounts for about 25% of total 462.libquantum time. ===struct node_struct { float _Complex gap; unsigned long long state; }; struct reg_struct { int size; struct node_struct *node; }; void func(int target, struct reg_struct *reg) { int i; for(i=0; isize; i++) reg->node[i].state ^= ((unsigned long long) 1 << target); } === This loop vectorizes into : # vectp.8_39 = PHIvect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); vect__5.11_41 = vect_array.10[0]; vect__5.12_42 = vect_array.10[1]; vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; _48 = BIT_FIELD_REF ; MEM[(long long unsigned int *)ivtmp_45] = _48; ivtmp_50 = ivtmp_45 + 16; _51 = BIT_FIELD_REF ; MEM[(long long unsigned int *)ivtmp_50] = _51; which then becomes for aarch64: .L4: ld2 {v0.2d - v1.2d}, [x1] add w2, w2, 1 cmp w2, w7 eor v0.16b, v2.16b, v0.16b umovx4, v0.d[1] st1 {v0.d}[0], [x1] add x1, x1, 32 str x4, [x1, -16] bcc .L4
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 --- Comment #8 from Richard Biener rguenth at gcc dot gnu.org 2013-03-27 11:27:31 UTC --- The issue is that we cannot use a vector v4sf store to opoints[i][0] as opoints[i][4] is not stored to. Such masked store (or interleaved store with gaps) is not supported by SLP.
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Blocks||53947 --- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-07-13 08:43:04 UTC --- Link to vectorizer missed-optimization meta-bug.
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed|2006-09-19 07:10:15 |2011-05-22 17:40:15 --- Comment #6 from Steven Bosscher steven at gcc dot gnu.org 2011-05-22 15:40:28 UTC --- Still not vectorized in recent GCC t.c:20: note: not vectorized: complicated access pattern. t.c:22: note: not vectorized: complicated access pattern. 1typedef unsigned int bool; 2#define true 1 3 4#define NUMPOINTS 5 5 6#define align(x) __attribute__((align(x))) 7 8typedef float align(16) MATRIX[3][3]; 9 10static float points[NUMPOINTS][4]; 11static align(16) float opoints[NUMPOINTS][4]; 12static bool flags[NUMPOINTS]; 13static MATRIX gmatrix; 14 15 16void RotateVectors (void) 17{ 18 int i, r; 19 20 for (r = 0; r 4; r++) 21 { 22for (i = 0; i NUMPOINTS; i++) 23{ 24 opoints[i][0] = gmatrix[0][0] * points[i][0] 25+ gmatrix[0][1] * points[i][1] 26+ gmatrix[0][2] * points[i][2]; 27 opoints[i][1] = gmatrix[1][0] * points[i][0] 28+ gmatrix[1][1] * points[i][1] 29+ gmatrix[1][2] * points[i][2]; 30 opoints[i][2] = gmatrix[2][0] * points[i][0] 31+ gmatrix[2][1] * points[i][1] 32+ gmatrix[2][2] * points[i][2]; 33 flags[i] = true; 34} 35 } 36} 37 GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
--- Comment #5 from irar at il dot ibm dot com 2007-01-07 07:40 --- On the todo list. BTW, vectorization of strided accesses was committed to the mainline 4.3. Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
--- Comment #4 from giovannibajo at libero dot it 2007-01-05 00:37 --- Thanks Ira. What about store with gaps? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
--- Comment #3 from irar at il dot ibm dot com 2006-09-19 07:10 --- t.c:20: note: not vectorized: mixed data-types t.c:20: note: can't determine vectorization factor. Removing flags[i] = true; Multiple data-types vectorization is already supported in the autovect branch, and the patches for mainline (starting from http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00941.html) will be committed as soon as 4.3 is open. we get: t.c:20: note: not consecutive access t.c:20: note: not vectorized: complicated access pattern. Vectorization of strided accesses is also already implemented in the autovect branch (and will be committed to the mainline 4.3). However, this case contains stores with gaps (stores to opoints[i][0], opoints[i][1], and opoints[i][2], without a store to opoints[i][3]), and only loads with gaps are currently supported. Therefore, this loop will be vectorizable in the autovect branch (and soon in the mainline 4.3) if a store to opoints[i][3] is added. Ira -- irar at il dot ibm dot com changed: What|Removed |Added CC||irar at il dot ibm dot com Last reconfirmed|2005-12-21 03:49:03 |2006-09-19 07:10:15 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
--- Additional Comments From pinskia at gcc dot gnu dot org 2005-09-20 17:47 --- t.c:20: note: not vectorized: mixed data-types t.c:20: note: can't determine vectorization factor. Removing flags[i] = true; we get: t.c:20: note: not consecutive access t.c:20: note: not vectorized: complicated access pattern. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
[Bug tree-optimization/18438] vectorizer failed for vector matrix multiplication
--- Additional Comments From pinskia at gcc dot gnu dot org 2004-11-12 02:43 --- Confirmed, ICC can do this but does not because it is not very inefficient to do it. -- What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed||1 Last reconfirmed|-00-00 00:00:00 |2004-11-12 02:43:35 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438